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Abstract 

Ensemble forecasts of weather and climate are subject to systematic 
biases in the ensemble mean and variance, leading to inaccurate estimates 
of the forecast mean and variance. To address these biases, ensemble fore¬ 
casts are post-processed using statistical recalibration frameworks. These 
frameworks often specify parametric probability distributions for the ver¬ 
ifying observations. A common choice is the Normal distribution with 
mean and variance specified by linear functions of the ensemble mean and 
variance. The parameters of the recalibration framework are estimated 
from historical archives of forecasts and verifying observations. Often 
there are relatively few forecasts and observations available for param¬ 
eter estimation, and so the fitted parameters are also subject to uncer¬ 
tainty. This artefact is usually ignored. This study reviews analytic results 
that account for parameter uncertainty in the widely used Model Output 
Statistics recalibration framework. The predictive bootstrap is used to 
approximate the parameter uncertainty by resampling in more general 
frameworks such as Non-homogeneous Gaussian Regression. Forecasts 
on daily, seasonal and annual time scales are used to demonstrate that 
accounting for parameter uncertainty in the recalibrated predictive dis¬ 
tributions leads to probability forecasts that are more skilful and reliable 
than those in which parameter uncertainty is ignored. The improvements 
are attributed to more reliable tail probabilities of the recalibrated fore¬ 
cast distributions. 


1 Introduction 


Raw forecasts produced by numerical atmosphere-ocean models are often not 

representative of the real world. Modellers have long realised that t here are sys- _ 

temat ic discrepancies between model simulations and the real world. iGlahn and Lowry 
1972l| suggested that a linear transformation should be applied to w eather model 


forecasts to issue predictions of real world observables. Similarly, iLeith 19741 
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suggested that “a final regression step is needed” to get the “best estimate of 
the true state”. 

The past 50 years has seen a shift in focus from forecasts that are determinis¬ 
tic in nature to probability forecasts that represent the forecaster’s uncertainty 
of the future, while also providing a point forecast. To this end, ensemble fore¬ 
casts have become widely used in the fields of climate science and numerical 
weather prediction. An ensemble forecast is simply a collection of forecasts that 
differ in one or more of the model physics, resolution, or initial conditions. 

Statistical adjustment to better fit numerical model output to real world 
observations is also known as forecast recalibration. A parametric statistical 
framework is specified that takes the raw numerical model forecast as an input, 
and outputs an estimate of the real world. To avoid confusion due to overuse 
of the word “model”, the term “statistical framework” is used in place of the 
more common “statistical model”. One of the simplest statistical forecast ad¬ 
justments is the removal of a constant bias. The underlying assumption is that 
the observation is equal to the model output plus a constant offset. The offset 
is represented by an unknown parameter that must be estimated from training 
data. 

A variety of more flexible recalibration frameworks have been proposed. 
In general, the choice of the recalibration framework depends on the forecast 
quantity, and on the assumptions about the forecast errors; for example, tem¬ 
perature forecasts require different recalibration frameworks than precipitation 
forecasts. Two of the most well-kn own frameworks are Model Output Statis¬ 
tics fM OS.IOl ahn and Lowrylll972l | and Non-homogeneous Gaussian Regression 
[NGR,. [Cneiting et ah . 2005l |. MOS is equivalent to Normal linear regression of 
the observations on the model output. NGR extends MOS by allowing the pre¬ 
dictive uncertainty to depend on the ensemble spread. The more flexible the 
framework, the greater the number of parameters to be estimated. 

It is common practice to recalibrate the forecast using the estimated pa¬ 
rameters as the “correct parameter values”, i.e., by treating them as known 
constants. However, the parameters are estimated from a finite sample of train¬ 
ing data and so are also subject to uncertainty. By naively using the fitted 
parameters in issuing probability forecasts, the uncertainty in the parameter 
estimates is ignored. 

The problem of accounting for parameter uncertainty when re calibrating cli¬ 
mate and weather forecasts is not unknown. Glahn et al. 2009bl | state that the 
predictive distribution of forecasts recalibrated by linear regression should be a 
Student’s t-distribution with inflated variance. Related rem arks can be found in 
Mason and Mimmack 2002l| , ISrocker and Smithl 2008 1, and Unger et al. 2009l| . 


Forecast recalibration using dynamic li near models (Kalman filtering) also lead s 


to forecasts that follow a t-distribution Sohn et al. , 2003 , Pagowski et al. , 2006l| . 


Bayesian methods can also be used to account for paramet er uncertainty in fore¬ 


casts rec alibrated by linear re gression [Martv et al.l . l2014l| , or by latent variable 


methods Siegert et ah . 2015l| . 


This study investigates the effect of accounting for parameter uncertainty on 
the reliability and skill of recalibrated probability forecasts. Section [2] describes 
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analytic methods to account for parameter uncertainty in MOS, and proposes 
a simple bootstrap method to account for parameter uncertainty in NGR. Sec¬ 
tion [3] applies the methods developed in Section [3] to three meteorological fore¬ 
cast data sets: an annual mean temperature forecast, a seasonal forecast of the 
North-Atlantic Oscillation, and a short-range 48-hour temperature forecast. All 
three data sets demonstrate that accounting for parameter uncertainty improves 
both the skill and reliability of recalibrated forecasts. Section |4] concludes with 
a summary and discussion. 


2 Methodology 


2.1 Model Output Statistics (MOS) 

The original application of MOS was statistical downscaling, i.e., to produce 
forecasts of quantities that are not explicitl y modelled numerically, such as 


surface winds or probability of precipitation Glahn and Lowrvl . |1972|. How¬ 


ever, MOS has been widely use d for the statistical r ecalibration of both deter¬ 
ministic and ensemb le forecasts Kharin and Zwier^ . 20031 Tippett et al. . 2005 . 

2009bl| . As noted above, MOS is equivalent to Normal linear re- 


Glahn et al 


gression. The future (unknown) observation is represented by a linear function 
of the numerical model output, with Normally distributed forecast errors. Let 
yt denote the observed conditions at time t, and let rrit be the corresponding 
ensemble forecast mean. Then the MOS recalibration framework is given by 


yt = a + bmt + cst, 


( 1 ) 


where e* is a standard Normally-distributed random variable, i.e., et ^ A/’(0,1). 
For simplicity, we focus on simple linear regression. However, all the results 
presented in this paper extend easily to multiple li near regression, whe re the 
forecast mean depends on more than one input [e.g., Glahn et al.l . l200^ |. 

The parameters a, b, and c can be estimated by maximising the log-likelihood 
under the assumption that the errors et,t = 1,2, ...,7t, are independent and 
follow a standard Normal distribution. Given a training set of n ensemble 
forecast means mi,..., irin and corresponding verifying observations yi,... ,yn, 
the unbiased maximum-likelihood estimates of a, b, and are 


a = y- 


^my — 

-^rn, 

sL 



c^ = 




yt - a - bmt 




(2a) 

(2b) 

(2c) 


where rh and y denote the overall sample means of the ensemble means mi,..., m„ 
and observations yi,... ,yn in the training sample, and Smy and denote the 
sample covariance between ensemble means and observations and the sample 
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variance of the ensemble means, respectively. Note the division by n — 2 in 
Eqn. to account for the fact that two mean parameters (a and b) have been 


estimated [Draper et al.l . 119981 Chp. 1]. The fitted parameters a, b and c can 


then be used to recalibrate new forecasts for yet unknown observations. 


2.2 Non-homogeneous Gaussian Regression 


Recalibration using MOS explicitly assumes that the predictive variance is con¬ 
stant for all forecasts, i.e., equal to c^. In practice, the forecast uncertainty 
might be different on different occasions due to varying error growth rates, and 
more or less predictable weather regimes. If the numerical model can reproduce 
this variability, then there might be useful information not only in the ensemble 
mean, but also in the ensemble variance. Recalibration using NGR aims to ex¬ 
ploit systematic re lationships between t he ensemble variance and the variance 
of forecast errors Gneiting et ah . 20051 NGR]. The NGR forecast mean is a 
linear function of the ensemble mean, rrit, identical to MOS. But unlike MOS, 
the forecast variance at time f is a linear function of the ensemble variance, Vt- 
The NGR recalibration framework for the observation yt is thus 


yt ^ N {a + bnit, c + dvt) ■ 


(3) 


Note that NGR recalibration with the parameter d fixed at zero is equivalent 
to MOS recalibrat i on. 

Gneiting et al. 2005l| suggest parameter estimation for NGR by numerical 


minimisation of the Continuous Ranked Probability Score (CRPS). However, 


Williams et al.l [20I4|| found little advantage of minimum-CRPS estimation com¬ 


pared to maximum-likelihood estimation. To maintain continuity with the MOS 
parameter estimators, the NGR recalibration parameters are estimated by max¬ 
imising the log-likelihood function 


^NGR 


1 f ^ , (2/t - a - brrit) 

log (c + dvt)-\ - — - 

c -I- dvt 


(4) 


Since closed form solutions are not available for the maximum-likelihood esti¬ 
mates of the NGR parameters, numerical optimisation is used0 To ensure that 
the forecast variance is posit ive, the variance scale parameter is represented by 
d = S'^ and optimised over S Gneiting et ah . 2005l| . 


2.3 Parameter uncertainty in MOS: Analytic results 

Suppose the maximum-likelihood estimates d, 6, and c from Equations [5^ - 
[2cl are known. These parameter estimates can be used to recalibrate a new 
ensemble mean forecast m* for a yet unknown future observation y*. It is 

^Numerical optimisation is performe d using the function optim in the stats package of the 
R statistical programming environment IR Core Team! 120151 . version 3.2.0] 
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common practice to directly substitute the maximum-likelihood estimates into 
the MOS recalibration framework, so that the forecast distribution for y* is 


y 




a -|- bm*, c 


')■ 


(5) 


This recalibration f r amework has been used for probabilistic forecasting by, e.g., 
Kharin and Zwier^ 2003 1. Tippett et al. 


20051 


Results from linear regression [Draper et al.l . 119981 Chp. 1] show that the 


predictive distribution should also include parameter uncertainty. That is, the 
predictive distribution should not only include the estimated variance of the 
forecast errors c^, but should also account for the estimation uncertainty in 
the parameter estimates a, b and c^. The sampling variances of the parameter 
estimates a and b lead to an increase of the predictive variance of the Normal 
distribution. Uncertainty due to estimation of the error variance leads to a 
transformation of the Normal distribution into a Student’s t-distribution. 

When parameter uncertainty is taken into account, the predictive distribu¬ 
tion in linear regression becomes a t-distribution with inflated variance: 


y 


tn -2 d + bm*, (? 


IH-U 


[m — m) 

Trt=i {rm-mf 


( 6 ) 


The forecast variance is inflated by a term that depends on both the sample 
size n, and the distance of the ensemble forecast mean m* from the overall 
mean of the training forecasts, fh. The function denotes the non- 

standardized Student’s t-distribution with v degrees of freedom, location y and 
scale cr (see appendix). In the limit as ^ oo, the t-distribution converges to 
a Normal distribution. However, for small the tails of the t-distribution are 
heavier than those of the Normal distribution. Therefore, the variances of the 
predictive distributions given by Equations [S] and [5] differ when the sample size 
n is small, or when |m* — rh\ is large. Note that the forecast mean does not 
change. 

The difference due to the adjustment for parameter uncertainty is illustrated 
in Figured] The standard Normal distribution Af{0, 1) is compared to the non- 
standardized t-distribution ti8(0,1.1). The 18 degrees of freedom correspond to 
a sample size of n = 20, and a 10% inflated variance corresponds to a forecast m* 
that differs from the overall forecast mean of the training data by one standard 
deviation. The example can thus be considered a typical case for what one 
might expect to see with a training sample of climate forecasts. Figured) shows 
that the difference due to the adjustment for parameter uncertainty is generally 
small, and so one should not expect the differences in forecast quality to be 
substantial. 


2.4 Parameter uncertainty in NGR: The predictive boot¬ 
strap 


For NGR recalibration, Gneiting et al. 2005l| suggest substituting the parameter 
estimates a, b, c and d, as well as the ensemble mean m* and variance u*, directly 
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forecast target 

Figure 1: Illustration of difference between the standard Normal distribution 
A/’(0,1) (solid line) and the non-standardized t-distribution ti8(0,l.l) (dashed 
line). 


into Equation [3l The forecast distribution for the future observation y* is then 
given by 


y 


'Ar(. 


a + bm*, c + dv* 


(7) 


This approach has als o been used to recalibrate probability forec asts in a number 


of other studies, e.g.. iHagedorn et ^ 

However, in keeping with the discussion in Section 12.31 


2008 1 ■ Kann et al. 2009l| . 


simply substitut¬ 
ing the parameter estimates and issuing forecasts with a Normal distribution 
ignores parameter uncertainty. No analytic expression for the NGR forecast dis¬ 
tribution that accounts for parameter uncertainty has been published, to date. 
Therefore, a means of approximating the parameter uncertainty, and account¬ 
ing for the parameter uncertainty in the predictive distribution is required. The 
param e ter un certainty can be estimated non-parametrically by bootstrapping 
Efronl 19821. Generating predictive distributions using bootstrap resa mpling 


is kno wn as predictive bootstrapping, and was originally proposed by iHarrisI 
1989l| . Given a historical archive of ensemble mean forecasts mt, ensemble 
variances Ut, and corresponding observations j/t, for times t = 1,2,...,n, the 
following bootstrap resampling protocol is proposed: 


1. Generate a new training data set of size n by randomly sampling n times 
with replacement from the available pairs of historical forecasts and obser¬ 
vations; 

2. Compute the maximum likelihood NGR parameter estimates using this 
new training data set; and 

3. Repeat steps 1. and 2. K times. 
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forecast target 

Figure 2: Illustration of the predictive bootstrap to account for parameter un¬ 
certainty in NGR using a training data set of n = 19. The Normal forecast 
distribution using maximum likelihood estimates of the NGR parameters (solid 
black line), 50 Normal distributions with maximum likelihood parameter esti¬ 
mates obtained by resampling the training data set (solid gray lines), and the 
distribution obtained by the averaging the 50 bootstrap distributions (dashed 
black line). 


The fcth resampling leads to the bootstrap parameter estimates dk, bk, Ck, and 
dk- The collection of K bootstrap parameter estimates approximates the pa¬ 
rameter uncertainty distribution. 

The objective is to generate a recalibrated forecast for the unknown value y* 
of the observation, using the ensemble mean forecast m* and ensemble variance 
V*. Each set of bootstrapped parameter estimates {dk,bk, Ck, dk),k = 1,2,K 
leads to a Normally-distributed forecast given by Equation [71 The K bootstrap 
samples are combined into a single predictive distribution by calculating the 
equally weighted average over the individual Normal distributions, thereby pro¬ 
ducing a Normal mixture distribution (see appendix). The bootstrap forecast 
distribution itself is therefore not a Normal distribution. 

The predictive bootstrap is illustrated in Figure [7] The effect of averaging 
over the bootstrapped Normal distribution is similar to the adjustment for pa¬ 
rameter uncertainty in MOS: The variance of the distribution is increased, and 
the tails are made heavier. Unlike MOS, the mode of the predictive distribu¬ 
tion can be different after accounting for parameter uncertainty. By exploring 
different values for the estimated slope parameter b, bootstrapping also inflates 
the forecast variance when the forecast mean m* is far from the mean of the 
training data to. 

The assumptions underlying the predictive bootstrap are quite different from 
those of the analytic method in Section 12.31 Both approaches assume that the 
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training data represent independent and identically distributed samples (con¬ 
ditional on the parameters) from some unknown distribution. However, the 
bootstrap approximates the distribution of the parameters, without making 
prior assumptions about the form of the distribution, e.g.. Normal, Student t, 
etc.. Therefore, bootstrapping is useful for frameworks such as NGR, where 
parametric inference is difficult or impossible. 


2.5 Forecast verification: CRPS, Ignorance, PIT histogram 


This section outlines the forecast verification measures used in this paper to 
assess the quality and improvements of probabilistic forecasts. Only the general 
forms of the verification measures are provided here. Equations for specific 
distributions can be found in the appendix. 

The I gnorance score is a verifica tion score to evaluate probability density 
forecasts [Roulston and Smithl . 20021. If the forecast probability density func¬ 
tion (pdf) is f{x), and the verifying observation is y, then the Ignorance score 
is given by 

*5»^(/,J/) =-log2/(y), (8) 


i.e., the negative logarithm of the forecast density evaluated at the observation. 
The Ignorance score is a local score, i.e. it only depends on the value of the 
forecast assigned to the verifying observation. If the basis 2 is used. Ignorance 
differences are measured in bits. An Ignorance difference of A > 0 bits between 
forecast A and forecast B implies that forecast B has assigned 2^ times more 
probability density to the verifying observation than forecast A. Forecast B, 
having lower Ignorance score than forecast A, can thus be considered to be the 
“better” forecast. Time-averaged Ignorance differences are used as summary 
measures of relative forecast performance. 

The Continuous Ranked Probability Score (CRPS) is designed to evalu- 
ate a cumulative forecast distribution ( cdf) F(x) for a scalar observation y 
Matheson and Winkler , 19761 Hersbacll l2000l| . The general form of the CRPS 


IS 


/ + 00 

[F{x) - H{x - y)f dx, 

-OO 


(9) 


where H{x) is the Heaviside step-function, i.e., H{x) =0ioi x < 0 and H{x) = 1 
otherwise. Unlike the Ignorance score, the CRPS is a non-local score. The 
CRPS depends not only on the value assigned to the observation, but also on 
how much forecast probability is concentrated near the observation, i.e., the 
CRPS is sensitive to the distance of the bulk of the forecast distribution from 
the observation. For a deterministic forecast (i.e. where F{x) is itself a step 
function), the CRPS is equal to the absolute difference between forecast and 
observation, and the CRPS vanishes for a perfect deterministic forecast where 
F{x) = F[{x—y). Therefore, the CRPS can be interpreted as a distance measure 
between forecast and observation, and a lower CRPS value can be taken as an 
indication of a “better” forecast. Note that the notion of “better” depends on 
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the score, e.g., forecast A can perform “better” than forecast B in terms of the 
Ignorance score, but “worse” in terms of the CRPS. 

The CRPS is measured on the scale of the forecast target x. The Continuous 
Ranked Probability Skill Score (CRPSS) provides a dimensionless measure of 
the difference in forecast performance. If the time-averaged CRPS of forecasts 
A and B are CRPSa and CRPSb, the relative improvement of forecast B over 
forecast A is given by 


CRPSS= 


CRPSa - CRPSb 
CRPSa 


( 10 ) 


A positive (negative) CRPSS indicates an improvement (deterioration) of fore¬ 
cast B compared to forecast A. CRPSS close to zero indicates that the forecasts 
are equally good. The CRPSS is bounded above at unity for a perfect forecast 
B, but has no lower bound. 

Probabilistic forecasts should be issued such that the verifying observations 
behave like random draws from the forecast d istributions. Such for ecasts are 
referred to as being reliable or well-calibrated Gneiting et al.L l2007 |. For reli¬ 
able forecasts, the observations fall on average equally often between the 0 and 
5 percentiles, between the 5 and 10 percentiles, etc., of the forecast distribution. 
Therefore, counting how often each percentile interval is occupied by the obser¬ 
vation provides a simple test of forecast reliability. In practice, the probability 
integral transform (PIT) of each forecast pdf ft{x) is calculated, which is the 
mass of forecast probability below its verifying observation: 


/ yt 

ft{x)dx. 

-OO 


( 11 ) 


For example, if the observation yt falls between the 5th and 10th percentiles 
of the forecast density /t(x), then the value of pitt will be between 0.05 and 
0.10. Since each percentile interval should be equally likely on average for a 
reliable forecast, reliability can be checked by observing the shape of the PIT 
histogram. Reliable forecasts have a flat PIT histogram, and a non-flat PIT his¬ 
togram indicates unreliable forecasts. In general, U-shaped histograms indicate 
underdispersed forecast distributions, i.e., overconfident forecasts. Conversely, 
n-shaped histograms suggest overdispersed forecast distributions (underconfi¬ 
dent forecasts), w hile slop i ng hi stograms are indicative of a systematic bias of 


the forecast mean Hamill. 20011 


3 Results 


3.1 CanCM4 gridded annual temperature forecasts 

The effect of accounting for parameter uncertainty was evaluated in forecasts of 
near-surface (2m) temperature generated by the fourth versio n of the Canadian 
coup led ocean-atmosphere general circulation model [CanCM4: lMerrvfield et ah . 
2011j . Forecast ensembles of 10 initial condition members were initialized on 
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1 January every year between 1960 and 2010. The forecast target was the 
mean temperature averaged over the first 12 months after initi alization. Ver¬ 
ifying observations were taken from the HadCRUTSv data set iBrohan et al 


2006j . Only grid boxes with a complete observational record were included in 


the analysis. The conclusions were not found to be sensitive to the choice of 
verification data; compa rable results wer e obtained when verifying against the 
ERA-Interim reanalysis Dee et all 2011 and all grid boxes. MOS-recalibrated 
probability forecasts were computed for each year in the period 1991-2010. It 
was found that the ensemble variance was not a skilful predictor of forecast 
errors, and so there was no advantage to using NGR recalibration. Forecasts 
were recalibrated using only information available up to the time of the forecast 
and evaluated out-of-sample. The recalibration parameters for each forecast 
were estimated using the previous 25 years of forecasts and observations, after 
linearly detrending both data sets. 




Probability integral transform 


Probability integral transform 


Figure 3: PIT histograms of recalibrated CanCM4 forecasts (ja|) before, and (|bj 
after accounting for parameter uncertainty. Before accounting for parameter 
uncertainty, the observations fall into the tails of the forecast distribution too 
often. 

The PIT histogram of the recalibrated forecasts without parameter uncer¬ 
tainty (Equation [S]) appear U-shaped, indicating underdispersed probability 
forecasts f Figure l3al). As a result of the underdispersion, the 90% prediction 
intervals cover the verifying observations only 81% of the time. Therefore, the 
forecasts without parameter uncertainty are not well calibrated. In contrast, 
the PIT histogram of the recalibrated forecasts after accounting for parameter 
uncertainty (Equation IH]) is almost completely flat (Figure [3b|. Each 5% inter¬ 
val of the predictive distributions covers the verifying observation roughly 5% of 
the time, and the 90% prediction intervals cover the verifying observations 89% 
of the time. The recalibrated forecasts including parameter uncertainty appear 
to be well calibrated. 

The time-averaged difference in the Ignorance score between the recalibrated 
forecasts before and after accounting for parameter uncertainty is positive at 
most grid boxes (Figure l4fej) . Positive Ignorance differences indicate that the 
forecasts that account for parameter uncertainty assign more probability density 
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CRPSS 


Figure 4: Improvement in overall skill of the recalibrated CanCM4 forecasts 
after accounting for parameter uncertainty, measured by (jaj) the difference in 
the Ignorance score, and o the CRPSS. Positive score differences / skill scores 
indicate improved average scores after accounting for parameter uncertainty. 


to the observations than those that do not, i.e., are more skilful. 

The CRPSS is also positive in the majority of grid boxes, supporting the 
conclusion that the forecasts that account for parameter uncertainty are on 
average more skilful (Figure 30. 


3.2 Met Office seasonal NAO forecasts 


This section presents a case study on seasonal climate forecasts using a more 
complicated recalibration framework. The data consist of 20 years of historical 
seasonal ensemble forecasts of the North Atlantic Oscillation (Figure [5|). The 
ensembles wer e produced by the UK M et Office Global Seasonal prediction sys¬ 


tem GloSeab [MacLachlan et al.l . 12014] . The forecasts were initialised using a 


lagged initialisation around 1 November each year between 1992 and 2011. The 
forecast target is the average North Atlantic Oscillation (NAO) index between 
December and February (DJF), measure d as a pressure diff erence between sta¬ 


tions situated in the Azores and Iceland Scaife et ah . 2014] . 


The sample correlation coefficient between the ensemble means and the ob¬ 
servations is 0.62. The correlation between the ensemble standard deviation and 
absolute error of the ensemble mean is also high at 0.45 (or 0.3 when the very 
influential year 2010 is excluded). Previous studies found that the skill of these 
forecasts ca n be improved by linear transformations of both the ensemble mean 
and spread Fade et al llml. Therefore, NCR recalibration was used to allow 
for linear adjustment not only of the predictive mean, but also the predictive 
variance. Due to the small sample size, forecasts were evaluated by leave-one- 
out cross-validation, i.e., each forecast was recalibrated using the other 19 as 
the training set. Due to the long time scale under consideration, each forecast 
occasion can reasonably be assumed to be independent, and so the leave-one-out 
approach is justifiable. 

Figure [5] illustrates the predictive distributions for the year 1997, issued as 
a Normal distribution with the maximum likelihood NCR parameter estimates, 
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1995 2000 2005 2010 


Figure 5: GloSeaS NAO ensemble forecasts (small circles), ensemble mean fore¬ 
casts (large filled circles) and verifying observations (black diamonds) plotted 
over the verification year. 



Figure 6: NAO observations (black markers) and their NGR predictive distri¬ 
butions using the ensemble data shown in Figure [5j Distributions without pa¬ 
rameter uncertainty (white boxes) and with parameter uncertainty (gray boxes) 
are depicted by box-and-whiskers plots, where boxes indicate the inter-quartile 
range and median, and whiskers extend from the 1 to the 99 percentile. 
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Figure 7: Ignorance scores of the forecast distributions shown in Figure [HI with¬ 
out accounting for parameter uncertainty (open circles) and after accounting 
for parameter uncertainty (full circles). 


and issued as a mixture of Normal distributions based on 500 bootstrap repli¬ 
cates. The Normal distributions generated from the individual bootstrap resam¬ 
ples vary considerably, indicating large parameter uncertainty. The variance of 
the averaged bootstrap distribution is larger, and the tails are heavier than 
for the Normal distribution that uses only the maximum-likelihood parameters. 
Figure [6] shows the recalibrated predictive distributions for the years 1993-2012, 
with and without parameter uncertainty, and their verifying observations. The 
forecast distributions with parameter uncertainty have slightly heavier tails, and 
their medians are on average slightly closer to the climatological mean. 

Figure [7] shows the Ignorance scores of the individual forecasts. Whenever 
the observation falls close to the bulk of the forecast distributions, the Ignorance 
scores for the two distributions are very similar. There are three cases where 
the observation falls well into the tail of the forecast distributions (1994, 2005 
and 2010). For these tail events, the Ignorance score improves when parameter 
uncertainty is taken into account. The predictive bootstrap leads to heavier 
tails, and thus to higher probabilities being assigned to “unexpected events”. 

The results of the case study presented in this Section suggest that the 
predictive bootstrap increases the forecast skill of recalibrated GloSea5 NAO 
forecasts. However, only 20 data points are considered, which does not allow 
for a robust statistical analysis of the forecast skill. A larger dataset of numerical 
weather predictions is analysed in the next Section. 

3.3 NCEP short range temperature forecasts 

Daily forecasts of near-surface (2m) temperature with a 48 hour lead time 
were also analysed. The forecasts were taken from version 2 of the refore- 
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cast p roject, hosted by th e National Oceanic and Atmospheric Administration, 


USA Hamill et all [201^ . The ensemble forecasts are approximately equiva¬ 


lent to those issued by the operational global ensemble forecasting system of 
the national centre for environmental prediction (NCEP). Ten member initial 
condition forecasts were issued at 00 UTC each day for a grid point close to 
New York City, USA (40N, 74W). The forecasts covered the period 26 May 
1990 - 15 September 2014, giving a total of n = 8,879 forecasts and verifying 
observations. The analyses (i.e., the control forecast at 0 lead time) were used 
as verifying observations. 

Preliminary investigations showed that NCR recalibration yielded more skil¬ 
ful forecasts than simple MOS recalibration. Therefore, the NCR recalibrated 
forecasts were used throughout. Recalibration parameters were estimated sep¬ 
arately for all n forecasts, using data from a rolling training window of pre¬ 
specified size w. That is, for each forecast, the w previous pairs of forecasts and 
observations were used as training data, such that all forecasts were evaluated 
out-of-sample. The rolling training window allows the recalibration scheme to 
adapt to non-stationarities in the hindcast data, such as the updating of the 
data assimilatio n scheme in 2011, or the dependency of the forecast bias on the 
time of the year Hamill et al.l . 20131. Parameter uncertainty was accounted for 
using the bootstrap approach described in Section 12.41 50 bootstrap replicates 
were used for each forecast. Increasing the number of bootstrap replicates was 
found to provide only very small improvements in forecast skill, at considerable 
computational expense due to the large sample of forecasts to be evaluated. 



Figure 8: (jaj) Ignorance scores, and (jb| CRPS as a function of training sample 
size for the recalibrated NCEP forecasts before (open circles) and after (filled 
circles) accounting for parameter uncertainty. 

Figure l8ln shows the Ignorance scores of the NGR-recalibrated forecasts as a 
function of the size of the rolling training window, before and after accounting 
for parameter uncertainty. The improvements in forecast skill produced by ac¬ 
counting for parameter uncertainty are evident for both small and large training 
samples. As expected, the scores converge for very large training datasets. It is 
encouraging that the predictive bootstrap yields improved forecasts even in rel¬ 
atively data-rich settings with hundreds of historical forecasts and observations. 
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where one might expect the effect of parameter uncertainty to be negligible. 
The CRPS values shown in Figure [HQ are qualitatively similar to those of the 
Ignorance scores. The optimum training period after accounting for parameter 
uncertainty is 50 days for both scores. Before accounting for parameter uncer¬ 
tainty, the optimal training period for CRPS is similar (60 days). However, the 
optimal training period for the Ignorance score is 400 days without account¬ 
ing for parameter uncertainty. For CRPS, the effect of the training length is 
large compared to the effect of parameter uncertainty, while the effects are of 
comparable magnitude for the Ignorance score. 



Figure 9: PIT histograms of recalibrated NCEP forecasts (jaj) before, and © 
after accounting for parameter uncertainty. Before accounting for parameter 
uncertainty, the forecasts show evidence of underdispersion and bias. After ac¬ 
counting for parameter uncertainty, the forecasts appear slightly overdispersed. 

Fjgures 1^ and Ibl show PIT histograms after recalibration based on a rolling 
training period of 50 days, before and after accounting for parameter uncer¬ 
tainty. The overpopulation of the outer bins of the PIT histograms when pa¬ 
rameter uncertainty was not accounted for is indicative of forecast distributions 
whose tails are too light. Accounting for parameter uncertainty by bootstrap 
resampling results in PIT histograms that are closer to being uniform. Fig¬ 
ure I3Q suggests that the resampling method has slightly overcompensated for 
the light tails, thus leading to overdispersive forecasts. In both cases, the PIT 
histograms of the bootstrap forecasts suggest remaining forecast biases, which 
might be an indication of a poor fit of the NGR recalibration framework to the 
data. Refinements of the underlying NGR framework are beyond the scope of 
this paper. 


4 Discussion and conclusions 

Parameter estimates in forecast recalibration frameworks are subject to uncer¬ 
tainty, particularly when estimated with small training samples. The effect of 
parameter uncertainty on the reliability and skill of probability forecasts has 
received little attention in the climate and meteorology literature. This study 
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has presented two methods of accounting for parameter uncertainty in recali¬ 
brated forecasts. Analytic results are available for MOS recalibration. Parame¬ 
ter uncertainty in more complex recalibration frameworks can be estimated by 
bootstrapping. The results presented here demonstrate that accounting for pa¬ 
rameter uncertainty can improve the reliability and skill of recalibrated forecasts 
across a range of time scales. 

The examples demonstrated here are representative of findings across a num¬ 
ber of forecast models, variables and time scales. In some cases, accounting 
for parameter uncertainty does not improve forecast reliability and skill. For 
example, accounting for parameter uncertainty did not improve seasonal aver¬ 


age E uropean temperature forecasts from the ECMWF Systemd [Molteni et al 


201 ij at a lead time of 3 months. The PIT histogram did not change, and the 


verification scores improved at only 50% of grid boxes, leaving the average fore¬ 
cast skill unchanged. However, no cases were identified where accounting for 
parameter uncertainty leads to forecasts that are less reliable or less skilful on 
average. 

The main effects of accounting for parameter uncertainty are the inflation of 
the forecast variance, and an increase in the weight of the tails of the forecast 
distribution. The wider, heavy tailed forecast distributions improve reliability 
by generating forecast distributions that are less underdispersive. This can be 
seen from the PIT histograms, which are less U-shaped after accounting for 
parameter uncertainty. Overall forecast skill is also improved, but the size of 
the improvement depends on the score. The Ignorance score is more sensitive to 
low-probability events than the CRPS. Since the main effect of accounting for 
parameter uncertainty is to improve the tails of the forecast distributions, the 
relative improvement in the Ignorance score is larger than that of the CRPS. 

The effect of accounting for parameter uncertainty is largest for small train¬ 
ing samples. The amount of training data can be limited by the available obser¬ 
vational record, by strong temporal and spatial correlations in the data, or by 
the computational expense of generating long hindcast experiments. However, 
small training samples are often deliberately chosen even in data-rich situations 
such as weather forecasting. The use of a rolling training window allows the 
recalibration to adapt to changes in background conditions, or changes to the 
forecast model. Alternative methods might be considered so that all prior data 
are included in the training sample but with decreasing weight given to older 
data. Another possibility would be to use the idea of analogues, and only cal¬ 
ibrate using prior data that are similar to conditions observed at the time the 
forecast is initialized. 

Analytic expressions for the forecast uncertainty are possible for some more 
complex recalibration frameworks. The predictive bootstrap is easily applicable 
to almost any recalibration framework. Bootstrapping can be modified to ac- 
count for temporal depende nce between the training data by block resampling 


Davison and Hinklevl . I1997L Chp. 8]. Alternative methods of estimating the 


parameter uncertainty include parametric bootstrapping, and asymptotic ap¬ 
proximations of the parameter uncertainty. Bayesian methods lead to similar 
analytic results in the case of MOS recalibration, and computational Bayesian 
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techniques can be used for more complex frameworks. 

This study has demonstrated that accounting for parameter uncertainty in 
probability forecasts leads to measurable improvements in both reliability and 
skill. Other researchers and practitioners are encouraged to investigate and 
adopt the methods proposed here, and to develop alternative methods for more 
complex recalibration frameworks. 
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Appendices 

The non-standardized t-distribution 

The pdf of the non-standardize d t-distribution wit h loca tion /i, scale cr and 
degrees-of-freedom v is given by West and Harris(^ 19971 




p (I) 


. , 1 jx - fj-Y 

V 


( 12 ) 


Normal mixture distribution 

Let p(x) and $(a;) denote the pdf and cdf of the standard Normal distribution 
respectively. If the forecast pdf j{x) is a mixture of K Normal distributions with 
weights wi, • • • , 10K (non-negative and summing to one), means • • • , pk and 
variances (Ti , • • • , cr^, then the forecast pdf /(x) is given by 


/(a;) 


K 

fe=i 


1 

—ip 

O'k 


/ X- flk \ 

V 0-fc y ’ 


and the forecast cdf F{x) is 


F{x) 


K 

k=l 


/ X- fj,k \ 
\ (Jk J 


(13) 


(14) 
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The Ignorance Score 

If the forecast pdf /(x) is Normal with mean fi and variance cr^, then the 
Ignorance score is given by 


ignif, y) 


i log (27rcr2) 


jy-gf 

2a2 


/log 2. 


(15) 


For a mixture of Normals, simply take the negative logarithm of Equation [T^ 
after substituting appropriate values for the forecast means and variances and 
for the observation. 

If the forecast pdf f{x) has the form of a non-standardized t-distribution, 
then the Ignorance score is given by 


i9n{f,y) 


-logF 


iy+1 


logF + ^log 




/log 2. 


(16) 


The Continnous Ranked Probability Score 

If the forec ast cdf F(x) is Normal with mean y and variance a^, then the CRPS 
is given by [Gneiting et al.l . [2005 1 


crps{F, y) = a 


y-y 


a 


24.1 


(IT) 


Grimit et alJ 2Q06|| showed that the CRPS of a mixture of Normal distributions 
is given by 


K 


CTps{F,y) = — ^uJkA{y - yk.(Tl) 




K K 


WfcWiA (/ife - 




( 18 ) 


k^l1^1 


where 


^ (a*, cr^) = 2crv? + /i (^2$ 0^ - 1^ 


(19) 


Analytical results for the CRPS of other distributions are difficult to derive. 
For example, no result has been published for the CRPS of the t-distribution, 
which appears as a predictive distribution in EquationlB] Numerical integration 
can be used where analytic results do not exist. The CRPS of forecasts issued 
as t-distributions was calculated using the function integ rate as implemented 
in package stats of the R statistical computing software R Core Team! . 12015 . 
version 3.2.0]. 
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Probability Integral Transformations 

If the forecast pdf ft{x) is Normal with mean Ht and standard deviation at, 
then the PIT is given by 


= ( 20 ) 

The PIT of a weighted sum of Normals is equal to the weighted sum of the PITs 
of the individual Normals. 

If the forecast pdf /t(x) is issued as a non-standardized t-distribution with 
location fit, scale at, and vt degrees-of-freedom, then the PIT is given by 

, ( 21 ) 

where T,^{x) is the cdf of the central t-distribution with 1 / degrees-of-freedom. 
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